14 research outputs found
Latency Analysis of Coded Computation Schemes over Wireless Networks
Large-scale distributed computing systems face two major bottlenecks that
limit their scalability: straggler delay caused by the variability of
computation times at different worker nodes and communication bottlenecks
caused by shuffling data across many nodes in the network. Recently, it has
been shown that codes can provide significant gains in overcoming these
bottlenecks. In particular, optimal coding schemes for minimizing latency in
distributed computation of linear functions and mitigating the effect of
stragglers was proposed for a wired network, where the workers can
simultaneously transmit messages to a master node without interference. In this
paper, we focus on the problem of coded computation over a wireless
master-worker setup with straggling workers, where only one worker can transmit
the result of its local computation back to the master at a time. We consider 3
asymptotic regimes (determined by how the communication and computation times
are scaled with the number of workers) and precisely characterize the total
run-time of the distributed algorithm and optimum coding strategy in each
regime. In particular, for the regime of practical interest where the
computation and communication times of the distributed computing algorithm are
comparable, we show that the total run-time approaches a simple lower bound
that decouples computation and communication, and demonstrate that coded
schemes are times faster than uncoded schemes
EM for Mixture of Linear Regression with Clustered Data
Modern data-driven and distributed learning frameworks deal with diverse
massive data generated by clients spread across heterogeneous environments.
Indeed, data heterogeneity is a major bottleneck in scaling up many distributed
learning paradigms. In many settings however, heterogeneous data may be
generated in clusters with shared structures, as is the case in several
applications such as federated learning where a common latent variable governs
the distribution of all the samples generated by a client. It is therefore
natural to ask how the underlying clustered structures in distributed data can
be exploited to improve learning schemes. In this paper, we tackle this
question in the special case of estimating -dimensional parameters of a
two-component mixture of linear regressions problem where each of nodes
generates samples with a shared latent variable. We employ the well-known
Expectation-Maximization (EM) method to estimate the maximum likelihood
parameters from batches of dependent samples each containing
measurements. Discarding the clustered structure in the mixture model, EM is
known to require iterations to reach the statistical accuracy
of . In contrast, we show that if initialized properly, EM on
the structured data requires only iterations to reach the same
statistical accuracy, as long as grows up as . Our analysis
establishes and combines novel asymptotic optimization and generalization
guarantees for population and empirical EM with dependent samples, which may be
of independent interest
Robust and Communication-Efficient Collaborative Learning
We consider a decentralized learning problem, where a set of computing nodes
aim at solving a non-convex optimization problem collaboratively. It is
well-known that decentralized optimization schemes face two major system
bottlenecks: stragglers' delay and communication overhead. In this paper, we
tackle these bottlenecks by proposing a novel decentralized and gradient-based
optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two
main ideas: (i) we impose a deadline on the local gradient computations of each
node at each iteration of the algorithm, and (ii) the nodes exchange quantized
versions of their local models. The first idea robustifies to straggling nodes
and the second alleviates communication efficiency. The key technical
contribution of our work is to prove that with non-vanishing noises for
quantization and stochastic gradients, the proposed method exactly converges to
the global optimal for convex loss functions, and finds a first-order
stationary point in non-convex scenarios. Our numerical evaluations of the
QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate
speedups of up to 3x in run-time, compared to state-of-the-art decentralized
optimization methods
Variance-reduced Clipping for Non-convex Optimization
Gradient clipping is a standard training technique used in deep learning
applications such as large-scale language modeling to mitigate exploding
gradients. Recent experimental studies have demonstrated a fairly special
behavior in the smoothness of the training objective along its trajectory when
trained with gradient clipping. That is, the smoothness grows with the gradient
norm. This is in clear contrast to the well-established assumption in folklore
non-convex optimization, a.k.a. --smoothness, where the smoothness is
assumed to be bounded by a constant globally. The recently introduced
--smoothness is a more relaxed notion that captures such behavior in
non-convex optimization. In particular, it has been shown that under this
relaxed smoothness assumption, SGD with clipping requires
stochastic gradient computations to find an --stationary solution. In
this paper, we employ a variance reduction technique, namely SPIDER, and
demonstrate that for a carefully designed learning rate, this complexity is
improved to which is order-optimal. Our designed learning
rate comprises the clipping technique to mitigate the growing smoothness.
Moreover, when the objective function is the average of components, we
improve the existing bound on the stochastic gradient
complexity to , which is order-optimal as well.
In addition to being theoretically optimal, SPIDER with our designed parameters
demonstrates comparable empirical performance against variance-reduced methods
such as SVRG and SARAH in several vision tasks
Robust Federated Learning: The Case of Affine Distribution Shifts
Federated learning is a distributed paradigm that aims at training models
using samples distributed across multiple users in a network while keeping the
samples on users' devices with the aim of efficiency and protecting users
privacy. In such settings, the training data is often statistically
heterogeneous and manifests various distribution shifts across users, which
degrades the performance of the learnt model. The primary goal of this paper is
to develop a robust federated learning algorithm that achieves satisfactory
performance against distribution shifts in users' samples. To achieve this
goal, we first consider a structured affine distribution shift in users' data
that captures the device-dependent data heterogeneity in federated settings.
This perturbation model is applicable to various federated learning problems
such as image classification where the images undergo device-dependent
imperfections, e.g. different intensity, contrast, and brightness. To address
affine distribution shifts across users, we propose a Federated Learning
framework Robust to Affine distribution shifts (FLRA) that is provably robust
against affine Wasserstein shifts to the distribution of observed samples. To
solve the FLRA's distributed minimax problem, we propose a fast and efficient
optimization method and provide convergence guarantees via a gradient Descent
Ascent (GDA) method. We further prove generalization error bounds for the
learnt classifier to show proper generalization from empirical distribution of
samples to the true underlying distribution. We perform several numerical
experiments to empirically support FLRA. We show that an affine distribution
shift indeed suffices to significantly decrease the performance of the learnt
classifier in a new test user, and our proposed algorithm achieves a
significant gain in comparison to standard federated learning and adversarial
training methods
Recommended from our members
Robust and Efficient Algorithms for Federated Learning and Distributed Computing
Training a large-scale model over a massive data set is an extremely computation and storage intensive task, e.g. training ResNet with hundreds of millions of parameters over the data set ImageNet with millions of images. As a result, there has been significant interest in developing distributed learning strategies that speed up the training of learning models. Due to the growing computational power of the ecosystem of billions of mobile and computing devices, many future distributed learning systems operate based on storing data locally and pushing computation to the network edge. Unlike traditional centralized machine learning environments, however, machine learning at the edge is characterized by significant challenges including (1) scalability due to severe constraints on communication bandwidth and other resources including storage and energy, (2) robustness to stragglers, and edge failures due to slow edge nodes, (3) models generalizing to non-i.i.d. and heterogeneous data.In this thesis, we focus on two important distributed learning frameworks: Federated Learning and Distributed Computing, with a shared goal in mind: how to provably address the critical challenges in such paradigms using novel techniques from distributed optimization, statistical learning theory, probability theory, and communication and coding theory to advance the state-of-the-art in efficiency, resiliency, and scalability. In the first part of the thesis, we devise three methods to mitigate communication cost, straggler resiliency and robustness to heterogeneous data in federated learning paradigms. Our main ideas are to employ model compression, adaptive device participation and distributionally robust minimax optimization, respectively for such challenges. We characterize provable improvements for the proposed algorithms in terms of convergence speed, expected runtime, and generalization gaps.Moving on to the second part, we consider important instances of distributed computing frameworks such as distributed gradient aggregation, matrix-vector multiplication and MapReduce-type computing tasks and propose several algorithms to mitigate the aforementioned bottlenecks in these settings. The key idea in our designs is to introduce redundant and coded computation in an elaborate fashion in order to benefit in communication cost and the total runtime. We also support our theoretical results in both parts by significant improvements in numerical experiments